查看以及改变文件的编码格式

您所在的位置:网站首页 file encoding 查看以及改变文件的编码格式

查看以及改变文件的编码格式

2023-06-15 06:35| 来源: 网络整理| 查看: 265

Linux

https://www.shellhacks.com/linux-check-change-file-encoding/

显示

在某一个目录下,直接执行file *

 

$ file *chucklu.autoend.js: HTML document, UTF-8 Unicode text, with very long lines, with CRLF line terminatorscustom.css: UTF-8 Unicode text, with CRLF line terminatorsSimpleMemory.css: UTF-8 Unicode text, with CRLF line terminators

 

$ file *chucklu.autoend.js: HTML document, Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminatorscustom.css: UTF-8 Unicode text, with CRLF line terminatorsSimpleMemory.css: UTF-8 Unicode text, with CRLF line terminators

 

$ file -bi chucklu.autoend.jstext/html;

 

$ file -bi custom.csstext/plain; charset=utf-8

 

-b,--brief   Don’t print filename (brief mode)

-i, --mime   Print filetype and encoding

 

 file -i *Daily Sales Report_2021_04_09.bad.csv:  application/csv; charset=utf-8Daily Sales Report_2021_04_09.good.csv: application/csv; charset=utf-16le

file *Daily Sales Report_2021_04_09.bad.csv:  CSV textDaily Sales Report_2021_04_09.good.csv: CSV text

file * --mime-encoding --mime-typeDaily Sales Report_2021_04_09.bad.csv:  application/csv; charset=utf-8Daily Sales Report_2021_04_09.good.csv: application/csv; charset=utf-16le

 

修改

iconv -f utf-16 -t ascii text.txt

 

windows

https://stackoverflow.com/questions/64860/best-way-to-convert-text-files-between-character-sets

On Windows with Powershell (Jay Bazuzi):

PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt

(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)

Edit

Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa

gc -en string in.txt | Out-File -en utf8 out.txt

Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".

CsCvt - Kalytta's Character Set Converter is another great command line based conversion tool for Windows.

 

How to detect the encoding of a file?

There is a pretty simple way using Firefox. Open your file using Firefox, then View > Character Encoding. Detailed here.

 

 解答

Files generally indicate their encoding with a file header. There are many examples here. However, even reading the header you can never be sure what encoding a file is really using.

For example, a file with the first three bytes 0xEF,0xBB,0xBF is probably a UTF-8 encoded file. However, it might be an ISO-8859-1 file which happens to start with the characters . Or it might be a different file type entirely.

Notepad++ does its best to guess what encoding a file is using, and most of the time it gets it right. Sometimes it does get it wrong though - that's why that 'Encoding' menu is there, so you can override its best guess.

For the two encodings you mention:

The "UCS-2 Little Endian" files are UTF-16 files (based on what I understand from the info here) so probably start with 0xFF,0xFE as the first 2 bytes. From what I can tell, Notepad++ describes them as "UCS-2" since it doesn't support certain facets of UTF-16. The "UTF-8 without BOM" files don't have any header bytes. That's what the "without BOM" bit means.

 

使用ude查看文件编码

https://www.nuget.org/packages/UDE.CSharp

https://github.com/errepi/ude

public void GetEncoding2(string filePath) { using (FileStream fs = File.OpenRead(filePath)) { Ude.CharsetDetector cdet = new Ude.CharsetDetector(); cdet.Feed(fs); cdet.DataEnd(); if (cdet.Charset != null) { Console.WriteLine("Charset: {0}, confidence: {1}", cdet.Charset, cdet.Confidence); } else { Console.WriteLine("Detection failed."); } } }

Charset: ASCII, confidence: 1                          file *显示的是 ASCII text, with CRLF line terminatorsCharset: UTF-8, confidence: 0.7525                 file *显示的是UTF-8 Unicode text, with CRLF line terminatorsCharset: gb18030, confidence: 0.99                file *显示的是ISO-8859 text, with CRLF line terminators

 

读取文件前4个字节 public string GetEncoding(string filePath) { var bom = new byte[4]; using (var file = new FileStream(filePath, FileMode.Open, FileAccess.Read)) { file.Read(bom, 0, 4); } var str = string.Join(" ", bom.Select(x => x.ToString("X2"))); Console.WriteLine($"{str}, {filePath}"); return str; }

 

使用C#代码保存文件为utf8 without bom filename = "2019-04-23-001.txt"; filePath = Path.Combine(folder, filename); using (StreamWriter sw = new StreamWriter(File.Open(filePath, FileMode.Create), new UTF8Encoding(false))) { sw.WriteLine("hello"); } filename = "2019-04-23-002.txt"; filePath = Path.Combine(folder, filename); using (StreamWriter sw = new StreamWriter(File.Open(filePath, FileMode.Create), new UTF8Encoding(false))) { sw.WriteLine("你好"); }

2019-04-23-001.txt: ASCII text, with CRLF line terminators2019-04-23-002.txt: UTF-8 Unicode text, with CRLF line terminators

C#在保存的时候,如果没有特殊字符,会自动保存utf8 without bom保存为ascii.

 

filename = "2019-04-23-003.txt"; filePath = Path.Combine(folder, filename); using (StreamWriter sw = new StreamWriter(File.Open(filePath, FileMode.Create), Encoding.ASCII)) { sw.WriteLine("hello"); } filename = "2019-04-23-004.txt"; filePath = Path.Combine(folder, filename); using (StreamWriter sw = new StreamWriter(File.Open(filePath, FileMode.Create), Encoding.ASCII)) { sw.WriteLine("你好"); }

2019-04-23-003.txt: ASCII text, with CRLF line terminators2019-04-23-004.txt: ASCII text, with CRLF line terminators

 

使用系统自带的notepad,新建文件并保存为ANSI

第一个文本文件中的内容,包含中文“你好”

2019-04-23-011.txt: ISO-8859 text, with no line terminators

第二个文本文件中的内容,包含英文“hello”2019-04-23-012.txt: ASCII text, with no line terminators

 

扩展阅读

Character Encoding in .NET

 



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3